IEICE global.ieice.org Site

Author Search Result

[Author] Tatsuo OHTSUKI(49hit)

21-40hit(49hit)

An Algorithm and a Flexible Architecture for Fast Block-Matching Motion Estimation
Jinku CHOI Nozomu TOGAWA Masao YANAGISAWA Tatsuo OHTSUKI

PAPER-VLSI Design

Vol:
E85-A No:12
Page(s):
2603-2611
The motion estimation can choose the most suitable algorithm for different kinds of motion types, formats, and characteristics. The video encoding system can be optimized for quality, speed, and power consumption. In this paper, we propose a reconfigurable approach to a motion estimation algorithm and hardware architecture. The proposed algorithm determines motion type and then selects adapted block-matching algorithm for different kinds of motion sequences. The quality of our algorithm is better than that of the TSS and the BBGDS algorithm, or comparable to the performance of the better of the two, and the computational complexity of our algorithm is significantly less than that of the TSS. We also propose hardware architecture for realizing two kinds of motion estimations in the same hardware. We implemented the flexible and reconfigurable hardware architecture by using address generator unit, delay unit, and parameters and by using the hardware description language (VHDL) and the SYNOPSYS synthesis design tools. We analyze the performance of the algorithm and present adapted algorithm for a low cost real time application.
A Secure Test Technique for Pipelined Advanced Encryption Standard
Youhua SHI Nozomu TOGAWA Masao YANAGISAWA Tatsuo OHTSUKI

LETTER

Vol:
E91-D No:3
Page(s):
776-780
In this paper, we presented a Design-for-Secure-Test (DFST) technique for pipelined AES to guarantee both the security and the test quality during testing. Unlike previous works, the proposed method can keep all the secrets inside and provide high test quality and fault diagnosis ability as well. Furthermore, the proposed DFST technique can significantly reduce test application time, test data volume, and test generation effort as additional benefits.
A High-Level Synthesis System for Digital Signal Processing Based on Data-Flow Graph Enumeration
Nozomu TOGAWA Takafumi HISAKI Masao YANAGISAWA Tatsuo OHTSUKI

PAPER-High-level Synthesis

Vol:
E81-A No:12
Page(s):
2563-2575
This paper proposes a high-level synthesis system for datapath design of digital processing hardwares. The system consists of four phases: (1) DFG (data-flow graph) generation, (2) scheduling, (3) resource binding, and (4) HDL (hardware description language) generation. In (1), the system does not generate only one best DFG representing a given behavioral description of a hardware, but more than one good DFGs representing it. In (2) and (3), several synthesis tools can be incorporated into the system depending on the required objectives. Thus we can obtain more than one datapath candidates for a behavioral description with their area and performance evaluation. In (4), the best datapath design is selected among those candidates and its hardware description is generated. The experimental results for applying the system to several benchmarks show the effectiveness and efficiency.
A Hardware/Software Cosynthesis System for Digital Signal Processor Cores with Two Types of Register Files
Nozomu TOGAWA Masao YANAGISAWA Tatsuo OHTSUKI

PAPER

Vol:
E83-A No:3
Page(s):
442-451
In digital signal processing, bit width of intermediate variables should be longer than that of input and output variables in order to execute intermediate operations with high precision. Then a processor core for digital signal processing is required to have two types of register files, one of which is used by input and output variables and the other one is used by intermediate variables. This paper proposes a hardware/software cosynthesis system for digital signal processor cores with two types of register files. Given an application program and its data, the system synthesizes a hardware description of a processor core, an object code running on the processor core, and software environments. A synthesized processor core can be composed of a processor kernel, multiple data memory buses, hardware loop units, addressing units, and multiple functional units. Furthermore it can have two types of register files RF1 and RF2. The bit width and number of registers in RF1 or RF2 will be determined based on a given application program. Thus a synthesized processor core will have small area with keeping high precision of intermediate operations compared with a processor core with only one register file. The experimental results demonstrate the effectiveness of the proposed system.
A Hardware/Software Partitioning Algorithm for Processor Cores with Packed SIMD-Type Instructions
Nozomu TOGAWA Koichi TACHIKAKE Yuichiro MIYAOKA Masao YANAGISAWA Tatsuo OHTSUKI

LETTER-Design Methodology

Vol:
E86-A No:12
Page(s):
3218-3224
This letter proposes a new hardware/software partitioning algorithm for processor cores with SIMD instructions. Given a compiled assembly code including SIMD instructions and a timing constraint, the proposed algorithm synthesizes an area-optimized processor core with a new assembly code. Firstly, we assume for each operation type a super SIMD functional unit which can execute all the SIMD instructions. Secondly we reduce a SIMD instruction or "sub-function" of each super functional unit, one by one, while the timing constraint is satisfied. At the same time, we update the assembly code so that it can run on the new processor configuration. By repeating this process, we finally find SIMD functional unit configuration as well as a processor core architecture. The promising experimental results are also shown.
Placement, Routing, and Compaction Algorithms for Analog Circuits
Imbaby I. MAHMOUD Toru AWASHIMA Koji ASAKURA Tatsuo OHTSUKI

PAPER-Algorithms for VLSI Design

Vol:
E76-A No:6
Page(s):
894-903
The performance of analog circuits is strongly influenced by their layout. Performance specifications are usually translated into physical constraints such as symmetry, common orientation, and distance constraints among certain components. Automatic digital layout tools can be adopted and modified to deal with the imposed performance constraints on the analog layout. The selection and modifications of algorithms to handle the analog constraints became the area of research in analog layout systems. The existing systems are characterized by the use of stochastic optimization techniques based placement, grid based or channel routers, and lack of compaction. In this paper, algorithms for analog circuit placement, routing, and compaction are presented. The proposed algorithms consider the analog oriented constraints, which are important from an analog layout point of view, and reduce the computation cost. The placement algorithm is based on a force directed method and consists of two main phases, each of which includes a tuning procedure. In the first phase, we solve a set of simultaneous linear equations, based upon the attractive forces. These attractive forces represent the interconnection topology of given blocks and some specified constraints. Symmetry constraint is considered throughout the tuning procedure. In the second phase, block overlap resulting from the first phase is resolved iteratively, where each iteration is followed by the symmetry tuning procedure. Routing is performed using a line expansion based gridless router. Routing constraints are taken into account and several routing priorities are imposed on the nets. The compactor part employs a constraint graph based algorithm while considering the analog symmetry constraints. The algorithms are implemented and integrated within an analog layout design system. An experimental result for an OP AMP provided by MCNC benchmark is shown to demonstrate the performance of the algorithms.
A Hardware/Software Cosynthesis System for Processor Cores with Content Addressable Memories
Nozomu TOGAWA Takao TOTSUKA Tatsuhiko WAKUI Masao YANAGISAWA Tatsuo OHTSUKI

PAPER

Vol:
E86-A No:5
Page(s):
1082-1092
Content addressable memory (CAM) is one of the functional memories which realize word-parallel equivalence search. Since a CAM unit is generally used in a particular application program, we consider that appropriate design for CAM units is required depending on the requirements for the application program. This paper proposes a hardware/software cosynthesis system for CAM processors. The input of the system is an application program written in C including CAM functions and a constraint for execution time (or CAM processor area). Its output is hardware descriptions of a synthesized processor and a binary code executed on it. Based on the branch-and-bound method, the system determines which CAM function is realized by a hardware and which CAM function is realized by a software with meeting the given timing constraint (or area constraint) and minimizing the CAM processor area (or execution time of the application program). We expect that we can realize optimal CAM processor design for an application program. Experimental results for several application programs show that we can obtain a CAM processor whose area is minimum with meeting the given timing constraint.
An Area/Time Optimizing Algorithm in High-Level Synthesis of Control-Based Hardwares
Nozomu TOGAWA Masayuki IENAGA Masao YANAGISAWA Tatsuo OHTSUKI

PAPER

Vol:
E84-A No:5
Page(s):
1166-1176
This paper proposes an area/time optimizing algorithm in a high-level synthesis system for control-based hardwares. Given a call graph whose node corresponds to a control flow of an application program, the algorithm generates a set of state-transition graphs which represents the input call graph under area and timing constraint. In the algorithm, first state-transition graphs which satisfy only timing constraint are generated and second they are transformed so that they can satisfy area constraint. Since the algorithm is directly applied to control-flow graphs, it can deal with control flows such as bit-wise processes and conditional branches. Further, the algorithm synthesizes more than one hardware architecture candidates from a single call graph for an application program. Designers of an application program can select several good hardware architectures among candidates depending on multiple design criteria. Experimental results for several control-based hardwares demonstrate effectiveness and efficiency of the algorithm.
A Hardware/Software Cosynthesis System for Digital Signal Processor Cores
Nozomu TOGAWA Masao YANAGISAWA Tatsuo OHTSUKI

PAPER

Vol:
E82-A No:11
Page(s):
2325-2337
This paper proposes a hardware/software cosynthesis system for digital signal processor cores and a hardware/software partitioning algorithm which is one of the key issues for the system. The target processor has a VLIW-type core which can be composed of a processor kernel, multiple data memory buses (X-bus and Y-bus), hardware loop units, addressing units, and multiple functional units. The processor kernel includes five pipeline stages (RISC-type kernel) or three pipeline stages (DSP-type kernel). Given an application program written in the C language and a set of application data, the system synthesizes a processor core by selecting an appropriate kernel (RISC-type or DSP-type kernel) and required hardware units according to the application program/data and the hardware costs. The system also generates the object code for the application program and a software environment (compiler and simulator) for the processor core. The experimental results demonstrate that the system synthesizes processor cores effectively according to the features of an application program and the synthesized processor cores execute most application programs with the minimum number of clock cycles compared with several existing processors.
Floorplan-Aware High-Level Synthesis for Generalized Distributed-Register Architectures
Akira OHCHI Nozomu TOGAWA Masao YANAGISAWA Tatsuo OHTSUKI

PAPER-High-Level Synthesis and System-Level Design

Vol:
E92-A No:12
Page(s):
3169-3179
As device feature size decreases, interconnection delay becomes the dominating factor of circuit total delay. Distributed-register architectures can reduce the influence of interconnection delay. They may, however, increase circuit area because they require many local registers. Moreover original distributed-register architectures do not consider control signal delay, which may be the bottleneck in a circuit. In this paper, we propose a high-level synthesis method targeting generalized distributed-register architecture in which we introduce shared/local registers and global/local controllers. Our method is based on iterative improvement of scheduling/binding and floorplanning. First, we prepare shared-register groups with global controllers, each of which corresponds to a single functional unit. As iterations proceed, we use local registers and local controllers for functional units on a critical path. Shared-register groups physically located close to each other are merged into a single group. Accordingly, global controllers are merged. Finally, our method obtains a generalized distributed-register architecture where its scheduling/binding as well as floorplanning are simultaneously optimized. Experimental results show that the area is decreased by 4.7% while maintaining the performance of the circuit equal with that using original distributed-register architectures.
A SIMD Instruction Set and Functional Unit Synthesis Algorithm with SIMD Operation Decomposition
Nozomu TOGAWA Koichi TACHIKAKE Yuichiro MIYAOKA Masao YANAGISAWA Tatsuo OHTSUKI

PAPER-Programmable Logic, VLSI, CAD and Layout

Vol:
E88-D No:7
Page(s):
1340-1349
This paper focuses on SIMD processor synthesis and proposes a SIMD instruction set/functional unit synthesis algorithm. Given an initial assembly code and a timing constraint, the proposed algorithm synthesizes an area-optimized processor core with optimal SIMD functional units. It also synthesizes a SIMD instruction set. The input initial assembly code is assumed to run on a full-resource SIMD processor (virtual processor) which has all the possible SIMD functional units. In our algorithm, we introduce the SIMD operation decomposition and apply it to the initial assembly code and the full-resource SIMD processor. By gradually reducing SIMD operations or decomposing SIMD operations, we can finally find a processor core with small area under the given timing constraint. The promising experimental results are also shown.
A Hardware/Software Cosynthesis Algorithm for Processors with Heterogeneous Datapaths
Yuichiro MIYAOKA Nozomu TOGAWA Masao YANAGISAWA Tatsuo OHTSUKI

PAPER

Vol:
E87-A No:4
Page(s):
830-836
This paper proposes a hardware/software cosynthesis algorithm for processors with heterogeneous registers. Given a CDFG corresponding to an application program and a timing constraint, the algorithm generates a processor configuration minimizing area of the processor and an assembly code on the processor. First, the algorithm configures a datapath which can execute several DFG nodes with data dependency at one cycle. The datapath can execute the application program at the least number of cycles. The branch and bound algorithm is applied and all the number of functional units and memory banks are tried. For an assumed number of functional units and memory banks, an appropriate number of heterogeneous registers and connections to functional units and registers are explored. The experimental results show effectiveness and efficiency of the algorithm.
Selective Low-Care Coding: A Means for Test Data Compression in Circuits with Multiple Scan Chains
Youhua SHI Nozomu TOGAWA Shinji KIMURA Masao YANAGISAWA Tatsuo OHTSUKI

PAPER

Vol:
E89-A No:4
Page(s):
996-1004
This paper presents a test input data compression technique, Selective Low-Care Coding (SLC), which can be used to significantly reduce input test data volume as well as the external test channel requirement for multiscan-based designs. In the proposed SLC scheme, we explored the linear dependencies of the internal scan chains, and instead of encoding all the specified bits in test cubes, only a smaller amount of specified bits are selected for encoding, thus greater compression can be expected. Experiments on the larger benchmark circuits show drastic reduction in test data volume with corresponding savings on test application time can be indeed achieved even for the well-compacted test set.
A performance-Oriented Simultaneous Placement and Global Routing Algorithm for Transport-Processing FPGAs
Nozomu TOGAWA Masao SATO Tatsuo OHTSUKI

PAPER

Vol:
E80-A No:10
Page(s):
1795-1806
In layout design of transport-processing FPGAs, it is required that not only routing congestion is kept small but also circuits implemented on them operate with higher operation frequency. This paper extends the proposed simultaneous placement and global routing algorithm for transport-processing FPGAs whose objective is to minimize routing congestion and proposes a new algorithm in which the length of each critical signal path (path length) is limited within a specified upper bound imposed on it (path length constraint). The algorithm is based on hierarchical bipartitioning of layout regions and LUT (Look Up Table) sets to be placed. In each bipartitioning, the algorithm first searches the paths with tighter path length constraints by estimating their path lengths. Second the algorithm proceeds the bipartitioning so that the path lengths of critical paths can be reduced. The algorithm is applied to transport-processing circuits and compared with conventional approaches. The results demonstrate that the algorithm satisfies the path length constraints for 11 out of 13 circuits, though it increases routing congestion by an average of 20%. After detailed routing, it achieves 100% routing for all the circuits and decreases a circuit delay by an average of 23%.
FOREWORD
Shuji TSUKIYAMA Tatsuo OHTSUKI

FOREWORD

Vol:
E76-A No:3
Page(s):
257-258
A Scan-Based Attack Based on Discriminators for AES Cryptosystems
Ryuta NARA Nozomu TOGAWA Masao YANAGISAWA Tatsuo OHTSUKI

PAPER-Embedded, Real-Time and Reconfigurable Systems

Vol:
E92-A No:12
Page(s):
3229-3237
A scan chain is one of the most important testing techniques, but it can be used as side-channel attacks against a cryptography LSI. We focus on scan-based attacks, in which scan chains are targeted for side-channel attacks. The conventional scan-based attacks only consider the scan chain composed of only the registers in a cryptography circuit. However, a cryptography LSI usually uses many circuits such as memories, micro processors and other circuits. This means that the conventional attacks cannot be applied to the practical scan chain composed of various types of registers. In this paper, a scan-based attack which enables to decipher the secret key in an AES cryptography LSI composed of an AES circuit and other circuits is proposed. By focusing on bit pattern of the specific register and monitoring its change, our scan-based attack eliminates the influence of registers included in other circuits than AES. Our attack does not depend on scan chain architecture, and it can decipher practical AES cryptography LSIs.
Fast Scheduling and Allocation Algorithms for Entropy CODEC
Katsuharu SUZUKI Nozomu TOGAWA Masao SATO Tatsuo OHTSUKI

PAPER-High Level Synthesis

Vol:
E80-D No:10
Page(s):
982-992
Entropy coding/decoding are implemented on FPGAs as a fast and flexible system in which high-level synthesis technologies are key issues. In this paper, we propose scheduling and allocation algorithms for behavioral descriptions of entropy CODEC. The scheduling algorithm employs a control-flow graph as input and finds a solution with minimal hardware cost and execution time by merging nodes in the control-flow graph. The allocation algorithm assigns operations to operators with various bit lengths. As a result, register-transfer level descriptions are efficiently obtained from behavioral descriptions of entropy CODEC with complicated control flow and variable bit lengths. Experimental results demonstrate that our algorithms synthesize the same circuits as manually designed within one second.
A Retargetable Simulator Generator for DSP Processor Cores with Packed SIMD-type Instructions
Nozomu TOGAWA Kyosuke KASAHARA Yuichiro MIYAOKA Jinku CHOI Masao YANAGISAWA Tatsuo OHTSUKI

PAPER-Simulation Accelerator

Vol:
E86-A No:12
Page(s):
3099-3109
A packed SIMD type operation or a SIMD operation is n-parallel b/n-bit sub-operations executed by the modified n-bit functional unit. Such a functional unit is called a SIMD functional unit and a processor core which can execute SIMD operations is called a SIMD processor core. SIMD operations can be effectively applied to image processing applications. This paper focuses on hardware/software cosynthesis of SIMD processor cores and particularly proposes a new simulator generator which simulates pipelined instructions for a SIMD processor. Generally, a SIMD functional unit has many options and then we can have so many different SIMD functional unit instances. However, since our hardware/software cosynthesis system synthesizes a special-purpose processor core for an input application program, it uses very limited SIMD functional unit instances. In the proposed approach, we consider a SIMD operation to be a set of SIMD sub-operations. By adding up the appropriate SIMD sub-operations, we construct a single SIMD operation. Then a SIMD functional unit behavior can be characterized by a collection of SIMD operations. This approach has the advantage that: if we have a small number of behavior libraries for SIMD sub-operations, we can instantiate a particular SIMD functional unit behavior. Experimental results demonstrate the effectiveness of the proposed approach.
A Circuit Partitioning Algorithm with Replication Capability for Multi-FPGA Systems
Nozomu TOGAWA Masao SATO Tatsuo OHTSUKI

PAPER

Vol:
E78-A No:12
Page(s):
1765-1776
In circuit partitioning for FPGAs, partitioned signal nets are connected using I/O blocks, through which signals are coming from or going to external pins. However, the number of I/O blocks per chip is relatively small compared with the number of logic-blocks, which realize logic functions, accommodated in the FPGA chip. Because of the I/O block limitation, the size of a circuit implemented on each FPGA chip is usually small, which leads to a serious decrease of logic-block utilization. It is required to utilize unused logic-blocks in terms of reducing the number of I/O blocks and realize circuits on given FPGA chips. In this paper, we propose an algorithm which partitions an initial circuit into multi-FPGA chips. The algorithm is based on recursive bi-partitioning of a circuit. In each bi-partitioning, it searches a partitioning position of a circuit such that each of partitioned subcircuits is accommodated in each FPGA chip with making the number of signal nets between chips as small as possible. Such bi-partitioning is achieved by computing a minimum cut repeatedly applying a network flow technique, and replicating logic-blocks appropriately. Since a set of logic-blocks assigned to each chip is computed separately, logic-blocks to be replicated are naturally determined. This means that the algorithm makes good use of unused logic-blocks from the viewpoint of reducing the number of signal nets between chips, i.e. the number of required I/O blocks. The algorithm has been implemented and applied to MCNC PARTITIONING 93 benchmark circuits. The experimental results demonstrate that it decreases the maximum number of I/O blocks per chip by a maximum of 49% compared with conventional algorithms.
A Circuit Partitioning Algorithm with Path Delay Constraints for Multi-FPGA Systems
Nozomu TOGAWA Masao SATO Tatsuo OHTSUKI

PAPER

Vol:
E80-A No:3
Page(s):
494-505
In this paper, we extend the circuit partitioning algorithm which we have proposed for multi-FPGA systems and present a new algorithm in which the delay of each critical signal path is within a specified upper bound imposed on it. The core of the presented algorithm is recursive bipartitioning of a circuit. The bipartitioning procedure consists of three stages: 0) detection of critical paths; 1) bipartitioning of a set of primary inputs and outputs; and 2) bipartitioning of a set of logic-blocks. In 0), the algorithm computes the lower bounds of delays for paths with path delay constraints and detects the critical paths based on the difference between the lower and upper bound dynamically in every bipartitioning procedure. The delays of the critical paths are reduced with higher priority. In 1), the algorithm attempts to assign the primary inputs and outputs on each critical path to one chip so that the critical path does not cross between chips. Finally in 2), the algorithm not only decreases the number of crossings between chips but also assigns the logic-blocks on each critical path to one chip by exploiting a network flow technique. The algorithm has been implemented and applied to MCNC PARTITIONING 93 benchmark circuits. The experimental results demonstrate that it resolves almost all path delay constraints with maintaining the maximum number of required I/O blocks per chip small compared with conventional alogorithms.

21-40hit(49hit)

Author Search Result

[Author] Tatsuo OHTSUKI(49hit)

An Algorithm and a Flexible Architecture for Fast Block-Matching Motion Estimation

A Secure Test Technique for Pipelined Advanced Encryption Standard

A High-Level Synthesis System for Digital Signal Processing Based on Data-Flow Graph Enumeration

A Hardware/Software Cosynthesis System for Digital Signal Processor Cores with Two Types of Register Files

A Hardware/Software Partitioning Algorithm for Processor Cores with Packed SIMD-Type Instructions

Placement, Routing, and Compaction Algorithms for Analog Circuits

A Hardware/Software Cosynthesis System for Processor Cores with Content Addressable Memories

An Area/Time Optimizing Algorithm in High-Level Synthesis of Control-Based Hardwares

A Hardware/Software Cosynthesis System for Digital Signal Processor Cores

Floorplan-Aware High-Level Synthesis for Generalized Distributed-Register Architectures

A SIMD Instruction Set and Functional Unit Synthesis Algorithm with SIMD Operation Decomposition

A Hardware/Software Cosynthesis Algorithm for Processors with Heterogeneous Datapaths

Selective Low-Care Coding: A Means for Test Data Compression in Circuits with Multiple Scan Chains

A performance-Oriented Simultaneous Placement and Global Routing Algorithm for Transport-Processing FPGAs

FOREWORD

A Scan-Based Attack Based on Discriminators for AES Cryptosystems

Fast Scheduling and Allocation Algorithms for Entropy CODEC

A Retargetable Simulator Generator for DSP Processor Cores with Packed SIMD-type Instructions

A Circuit Partitioning Algorithm with Replication Capability for Multi-FPGA Systems

A Circuit Partitioning Algorithm with Path Delay Constraints for Multi-FPGA Systems

Latest Issue

Links

Call for Papers

Submit to IEICE Trans.

Transactions NEWS

Popular articles